THE  PROCESS  GROUP  APPROACH  TO 
RELIABLE  DISTRIBUTED 
COMPUTING  Kenneth  P.  Birman 

One  might  expect  the  reliability  of  a  distributed 
system  to  correspond  directly  to  the  reliability  of 
its  constituents,  but  this  is  not  always  the  case. 
The  mechanisms  used  to  structure  a  distributed 


system  and  to  implement  cooperation  between 
components  play  a  vital  role  in  determining 


the  reliability  of  the  system.  Many 
contemporary  distributed  operating 
systems  have  placed  emphasis  on  com¬ 
munication  performance,  overlooking 
the  need  for  tools  to  integrate  com¬ 
ponents  into  a  reliable  whole.  The 
communication  primitives  supported 
give  generally  reliable  behavior,  but 
exhibit  problematic  semantics  when 
transient  failures  or  system  config¬ 
uration  changes  occur.  The  resulting 
building  blocks  are,  therefore,  unsuit¬ 
able  for  facilitating  the  construction  of 
systems  where  reliability  is  important. 

This  article  reviews  10  years  of 
research  on  ISIS,  a  system  that  pro¬ 
vides  tools  to  support  the  construc¬ 
tion  of  reliable  distributed  software. 
The  thesis  underlying  ISIS  is  that 
development  of  reliable  distributed 
software  can  be  simplified  using  pro¬ 
cess  groups  and  group  programming 
tools.  This  article  describes  the  ap¬ 
proach  taken,  surveys  the  system, 
and  discusses  experiences  with  real 
applications. 

It  will  be  helpful  to  illustrate 
group  programming  and  ISIS  in  a 
setting  where  the  system  has  found 
rapid  acceptance:  brokerage  and 
trading  systems.  These  systems  inte¬ 
grate  large  numbers  of  demanding 
applications  and  require  timely  reac¬ 
tion  to  high  volumes  of  pricing  and 


trading  information.1  It  is  not  un¬ 
common  for  brokers  to  coordinate 
trading  activities  across  multiple 
markets. 

Trading  strategies  rely  on  accurate 
pricing  and  market-volatility  data, 
dynamically  changing  databases  giv¬ 
ing  the  firm’s  holdings  in  various 
equities,  news  and  analysis  data,  and 
elaborate  financial  and  economic 
models  based  on  relationships  be¬ 
tween  financial  instruments.  Any  dis¬ 
tributed  system  in  support  of  this 
application  must  serve  multiple  com¬ 
munities:  the  firm  as  a  whole,  where 
reliability  and  security  are  key  con¬ 
siderations;  the  brokers,  who  depend 
on  speed  and  the  ability  to  customize 
the  trading  environment;  and  the 
system  administrators,  who  seek  uni¬ 
formity,  ease  of  monitoring  and  con¬ 
trol.  A  theme  of  this  article  is  that  all 
of  these  issues  revolve  around  the 
technology  used  to  “glue  the  system 
together.”  By  endowing  the  corre¬ 
sponding  software  layer  with  pre¬ 
dictable,  fault-tolerant  behavior,  the 
flexibility  and  reliability  of  the  over- 

‘Although  this  class  of  systems  certainly  de¬ 
mands  high  performance,  there  are  no  real¬ 
time  deadlines  or  hard  time  constraints,  such  as 
in  the  FAA’s  Advanced  Automation  System 
[14].  This  issue  is  discussed  further  in  the  sec¬ 
tion  “ISIS  and  Other  Distributed  Computing 
Technologies.” 


all  system  can  be  greatly  enhanced. 

Figure  1  illustrates  a  possible  in¬ 
terface  to  a  trading  system.  The  dis¬ 
play  is  centered  around  the  current 
position  of  the  account  being  traded, 
showing  purchases  and  sales  as  they 
occur.  A  broker  typically  authorizes 
purchases  or  sales  of  shares  in  a 
stock,  specifying  limits  on  the  price 
and  the  number  of  shares.  These  in¬ 
structions  are  communicated  to  the 
trading  floor,  where  agents  of  the 
brokerage  or  bank  trade  as  many 
shares  as  possible,  remaining  within 
this  authorized  window.  The  display 
illustrates  several  points: 

•  Information  backplane.  The  broker 
would  construct  such  a  display  by  in¬ 
terconnecting  elementary  widgets 
(e.g.,  graphical  windows,  computa¬ 
tional  widgets)  so  that  the  output  of 
one  becomes  the  input  to  another. 
Seen  in  the  large,  this  implies  the 
ability  to  publish  messages  and  sub¬ 
scribe  to  messages  sent  from  program 
to  program  on  topics  that  make  up 
the  “corporate  information  back¬ 
plane”  of  the  brokerage.  Such  a 
backplane  would  support  a  naming 
structure,  communication  interfaces, 
access  restrictions,  and  some  sort  of 
selective  history  mechanism.  For  ex¬ 
ample,  when  subscribing  to  a  topic, 
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an  application  will  often  need  key 
messages  posted  to  that  topic  in  the 
past. 

•  Customization.  The  display  suggests 
that  the  system  must  be  easily  cus¬ 
tomized.  The  information  backplane 
must  be  organized  in  a  systematic 
way  (so  that  the  broker  can  easily 
track  down  the  name  of  communica¬ 
tion  streams  of  interest)  and  flexible 
(allowing  the  introduction  of  new 
communication  streams  while  the 
system  is  active). 

•  Hierarchical  structure.  Although  the 
trader  will  treat  the  wide-area  system 
in  a  seamless  way,  communication 
disruptions  are  far  more  common  on 
wide-area  links  (say,  from  New  York 
to  Tokyo  or  Zurich)  than  on  local- 
area  links.  This  gives  the  system  a 
hierarchical  structure  composed  of 
local-area  systems  which  are  closely 
coupled  and  rich  in  services,  inter¬ 
connected  by  less  reliable  and 
higher-latency  wide-area  communi¬ 
cation  links. 

What  about  the  reliability  implica¬ 
tions  of  such  an  architecture?  In  Fig¬ 
ure  1,  the  trader  has  graphed  a  com¬ 
puted  index  of  technology  stocks 
against  the  price  of  IBM,  and  it  is 
easy  to  imagine  that  such  customiza¬ 
tion  could  include  computations  crit¬ 
ical  to  the  trading  strategy  of  the 
firm.  In  Figure  2,  the  analysis  pro¬ 
gram  is  “shadowed”  by  additional 
copies,  to  indicate  that  it  has  been 
made  fault-tolerant  (i.e.,  it  would 
remain  available  even  if  the  broker’s 
workstation  failed).  A  broker  is  un¬ 
likely  to  be  a  sophisticated  program¬ 
mer,  so  fault-tolerance  such  as  this 
would  have  to  be  introduced  by  the 
system — the  trader’s  only  action 
being  to  request  it,  perhaps  by  speci¬ 
fying  the  degree  of  reliability  needed 
for  this  analytic  program.  This 
means  the  system  must  automatically 
replicate  or  checkpoint  the  computa¬ 
tion,  placing  the  replicas  on  proces¬ 
sors  that  fail  independently  from  the 
broker’s  workstation,  and  activating  a 
backup  if  the  primary  fails. 

The  requirements  of  modern 
trading  environments  are  not  unique 
to  the  application.  It  is  easy  to  re¬ 
phrase  this  example  in  terms  of  the 
issues  confronted  by  a  team  of  seis¬ 
mologists  cooperating  to  interpret 
the  results  of  a  seismic  survey  under 
way  in  some  remote  and  inaccessible 


region,  a  doctor  reviewing  the  status 
of  patients  in  a  hospital  from  a  work¬ 
station  at  home,  a  design  group  col¬ 
laborating  to  develop  a  new  product, 
or  application  programs  cooperating 
in  a  factory-floor  process  control  set¬ 
ting.  The  software  of  a  modern  tele¬ 
communications  switching  product  is 
faced  with  many  of  the  same  issues, 
as  is  software  implementing  a  data¬ 
base  that  will  be  used  in  a  large  dis¬ 
tributed  setting.  To  build  applica¬ 
tions  for  the  networked  envi¬ 
ronments  of  the  future,  a  technology 
is  needed  that  will  make  it  as  easy  to 
solve  these  types  of  problems  as  it  is 
to  build  graphical  user  interfaces 
(GUIs)  today. 

A  central  premise  of  the  ISIS  proj¬ 
ect,  shared  with  several  other  efforts 
[2,  14,  19,  22,  25]  is  that  support  for 
programming  with  distributed  groups 
of  cooperating  programs  is  the  key  to 
solving  problems  such  as  the  ones 
previously  mentioned.  For  example, 
a  fault-tolerant  data  analysis  service 
can  be  implemented  by  a  group  of 
programs  that  adapt  transparently  to 
failures  and  recoveries.  The  publica¬ 
tion/subscription  style  of  interaction 
involves  an  anonymous  use  of  pro¬ 
cess  groups:  here,  the  group  consists 
of  a  set  of  publishers  and  subscribers 
that  vary  dramatically  as  brokers 
change  the  instruments  they  trade. 
Each  interacts  with  the  group 
through  a  group  name  (the  topic), 
but  the  group  membership  is  not 
tracked  or  used  within  the  computa¬ 
tion.  Although  the  processes  publish¬ 
ing  or  subscribing  to  a  topic  do  not 
cooperate  directly,  when  this  struc¬ 
ture  is  employed,  the  reliability  of  the 
application  will  depend  on  the  reli¬ 
ability  of  group  communication.  It  is 
easy  to  see  how  problems  could  arise 
if,  for  example,  two  brokers  monitor¬ 
ing  the  same  stock  see  different  pric¬ 
ing  information. 

Process  groups  of  various  kinds 
arise  naturally  throughout  a  distrib¬ 
uted  system.  Yet,  current  distributed 
computing  environments  provide  lit¬ 
tle  support  for  group  communica¬ 
tion  patterns  and  programming. 
These  issues  have  been  left  to  the 
application  programmer,  and  appli¬ 
cation  programmers  have  been 
largely  unable  to  respond  to  the  chal¬ 
lenge.  In  short,  contemporary  dis¬ 
tributed  computing  environments 


prevent  users  from  realizing  the  po¬ 
tential  of  the  distributed  computing 
infrastructure  on  which  their  appli¬ 
cations  run. 

Process  Groups 

Two  styles  of  process  group  usage 
are  seen  in  most  ISIS  applications: 

Anonymous  groups:  These  arise 
when  an  application  publishes  data 
under  some  “topic,”  and  other  pro¬ 
cesses  subscribe  to  that  topic.  For  an 
application  to  operate  automatically 
and  reliably,  anonymous  groups 
should  provide  certain  properties: 

1 .  It  should  be  possible  to  send  mes¬ 
sages  to  the  group  using  a  group  ad¬ 
dress .  The  high-level  programmer 
should  not  be  involved  in  expanding 
the  group  address  into  a  list  of  desti¬ 
nations. 

2.  If  the  sender  and  subscribers 
remain  operational,  messages  should 
be  delivered  exactly  once.  If  the 
sender  fails,  a  message  should  be  de¬ 
livered  to  all  or  none  of  the  subscrib¬ 
ers.  The  application  programmer 
should  not  need  to  worry  about  mes¬ 
sage  loss  or  duplication. 

3.  Messages  should  be  delivered  to 
subscribers  in  some  sensible  order. 
For  example,  one  would  expect  mes¬ 
sages  to  be  delivered  in  an  order  con¬ 
sistent  with  causal  dependencies:  if  a 
message  m  is  published  by  a  program 
that  First  received  m\  .  .  .  nii,  then  m 
might  be  dependent  on  these  prior 
messages.  If  some  other  subscriber 
will  receive  m  as  well  as  one  or  more 
of  these  prior  messages,  one  would 
expect  them  to  be  delivered  first. 
Stronger  ordering  properties  might 
also  be  desired,  as  discussed  later. 

4.  It  should  be  possible  for  a  sub¬ 
scriber  to  obtain  a  history  of  the 
group — a  log  of  key  events  and  the 
order  in  which  they  were  received.2 
If  n  messages  are  posted  and  the  first 
message  seen  by  a  new  subscriber  will 
be  message  m *,  one  would  expect 
messages  m\  .  .  .  j  to  be  reflected 
in  the  history,  and  messages  jw,-  .  .  . 
mn  to  all  be  delivered  to  the  new  pro¬ 
cess.  If  some  messages  are  missing 
from  the  history,  or  included  both  in 


2 The  application  itself  would  distinguish  mes¬ 
sages  that  need  to  be  retained  from  those  that 
can  be  discarded. 
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the  history  and  in  the  subsequent 
postings,  incorrect  behavior  might 
result. 

Explicit  groups:  A  group  is  explicit 
when  its  members  cooperate  directly: 
they  know  themselves  to  be  members 
of  the  group,  and  employ  algorithms 
that  incorporate  the  list  of  members, 
relative  rankings  within  the  list,  or  in 
which  responsibility  for  responding 
to  requests  is  shared. 

Explicit  groups  have  additional 
needs  stemming  from  their  use  of 
group  membership  information:  in 
some  sense,  membership  changes  are 
among  the  information  being  pub¬ 
lished  to  an  explicit  group.  For  ex¬ 
ample,  a  fault-tolerant  service  might 
have  a  primary  member  that  takes 
some  action  and  an  ordered  set  of 
backups  that  take  over,  one  by  one,  if 
the  current  primary  fails.  Here, 
group  membership  changes  (failure 
of  the  primary)  trigger  actions  by 
group  members.  Unless  the  same 
changes  are  seen  in  the  same  order 
by  all  members,  situations  could  arise 
in  which  there  are  no  primaries,  or 
several.  Similarly,  a  parallel  database 
search  might  be  done  by  ranking  the 
group  members  and  then  dividing 
the  database  into  n  parts,  where  n  is 
the  number  of  group  members.  Each 
member  would  do  l  fn' th  of  the  work, 
with  the  ranking  determining  which 
member  handles  which  fragment  of 
the  database.  The  members  need 
consistent  views  of  the  group  mem¬ 
bership  to  perform  such  a  search 
correctly;  otherwise,  two  processes 
might  search  the  same  part  of  the 
database  while  some  other  part  re¬ 
mains  unscanned,  or  they  might  par¬ 
tition  the  database  inconsistently. 

Thus,  a  number  of  technical  prob¬ 
lems  must  be  considered  in  develop¬ 
ing  software  for  implementing  dis¬ 
tributed  process  groups: 

•  Support  for  group  communication , 
including  addressing,  failure  atomic¬ 
ity,  and  message  delivery  ordering. 

•  Use  of  group  membership  as  an  input. 
It  should  be  possible  to  use  the  group 
membership  or  changes  in  member¬ 
ship  as  input  to  a  distributed  algo¬ 
rithm  (one  run  concurrently  by  mul¬ 
tiple  group  members). 

•  Synchronization.  To  obtain  globally 
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correct  behavior  from  group  applica¬ 
tions,  it  is  necessary  to  synchronize 
the  order  in  which  actions  are  taken, 
particularly  when  group  members 
will  act  independently  on  the  basis  of 
dynamically  changing,  shared  infor¬ 
mation. 

The  first  and  last  of  these  prob¬ 
lems  have  received  considerable 
study.  However,  the  problems  cited 
are  not  independent:  their  integra¬ 
tion  within  a  single  framework  is 
nontrivial.  This  integration  issue 
underlies  our  virtual  synchrony  exe¬ 
cution  model. 

Building  Distributed  Services 
over  conventional  Technologies 

In  this  section  we  review  the  techni¬ 
cal  issues  raised  in  the  preceding  sec¬ 
tion.  In  each  case,  we  start  by  de¬ 
scribing  the  problem  as  it  might  be 
approached  by  a  developer  working 
over  a  contemporary  computing  sys- 


Figurei.  Broker’s  trading  system 

Figure  2.  Making  an  analytic 
service  fault-tolerant 


Figures.  Inconsistent  connec¬ 
tion  states 
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tern,  with  no  special  tools  for  group 
programming.  Obstacles  to  solving 
the  problems  are  identified,  and 
used  to  motivate  a  general  approach 
to  overcoming  the  problem  in  ques¬ 
tion.  Where  appropriate,  the  actual 
approach  used  in  solving  the  prob¬ 
lem  within  ISIS  is  discussed. 

conventional  Message-Passing 
Technologies 

Contemporary  operating  systems 
offer  three  classes  of  communication 
services  [34]: 

•  Unreliable  datagrams:  These  services 
automatically  discard  corrupted  mes¬ 
sages,  but  do  little  additional  process¬ 
ing.  Most  messages  get  through,  but 
under  some  conditions  messages 
might  be  lost  in  transmission,  dupli¬ 
cated,  or  delivered  out  of  order. 

•  Remote  procedure  call:  In  this  ap¬ 
proach,  communication  results  from 
a  procedure  invocation  that  returns  a 
result.  RPC  is  a  relatively  reliable  ser¬ 
vice,  but  when  a  failure  does  occur, 
the  sender  is  unable  to  distinguish 
among  many  possible  outcomes:  the 
destination  may  have  failed  before  or 
after  receiving  the  request,  or  the 
network  may  have  prevented  or  de¬ 
layed  delivery  of  the  request  or  the 
reply. 

•  Reliable  data  streams:  Here,  commu¬ 
nication  is  performed  over  channels 
that  provide  flow  control  and  reli¬ 
able,  sequenced  message  delivery. 
Standard  stream  protocols  include 
TCP,  the  ISO  protocols,  and  TP4. 
Because  of  pipelining,  streams  gen¬ 
erally  outperform  RPC  when  an  ap¬ 
plication  sends  large  volumes  of  data. 
However,  the  standards  also  pre¬ 
scribe  rules  under  which  a  stream 
will  be  broken,  using  conditions 
based  on  timeout  or  excessive  re¬ 
transmissions.  For  example,  suppose 
that  processes  c,  s \  and  52  have  con¬ 
nections  with  one  another — perhaps, 
s  1  and  S2  are  the  primary  and  backup, 
respectively,  for  a  reliable  service  of 
which  c  is  a  client. 

Now,  consider  the  state  of  this  sys¬ 
tem  if  the  connection  from  c  to  S\ 
breaks  due  to  a  communication  fail¬ 
ure,  while  all  three  processes  and  the 
other  two  connections  remain  opera¬ 
tional  (Figure  3).  Much  like  the  situa¬ 
tion  after  a  failed  RPC,  c  and  s \  will 


now  be  uncertain  regarding  one  an¬ 
other’s  status.  Worse,  s2  is  totally  un¬ 
aware  of  the  problem.  In  such  a  situ¬ 
ation,  the  application  may  easily 
behave  in  an  inconsistent  manner.  In 
our  primary-backup  example,  c 
would  cease  sending  requests  to 
expecting  52  to  handle  them.  s%,  how¬ 
ever,  will  not  respond  (it  expects  s\  to 
do  so). 

In  a  system  with  more  compo¬ 
nents,  the  situation  would  be  greatly 
exacerbated.  From  this,  one  sees  that 
a  reliable  data  stream  has  guarantees 
little  stronger  than  an  unreliable  one: 
when  channels  break,  it  is  not  safe  to 
infer  that  either  endpoint  has  failed; 
channels  may  not  break  in  a  consis¬ 
tent  manner,  and  data  in  transit  may 
be  lost.  Because  the  conditions  under 
which  a  stream  break  are  defined  by 
the  standards,  one  has  a  situation  in 
which  potentially  inconsistent  behav¬ 
ior  is  unavoidable. 

These  considerations  lead  us  to 
make  a  collection  of  assumptions 
about  the  network  and  message  com¬ 
munication  in  the  remainder  of  the 
article.  First,  we  will  assume  the  sys¬ 
tem  is  structured  as  a  wide-area  net¬ 
work  (WAN)  composed  of  local-area 
networks  (LANs)  interconnected  by 
wide-area  communication  links. 
(WAN  issues  will  not  be  considered 
in  this  article  due  to  space  con¬ 
straints.)  We  assume  that  each  LAN 
consists  of  a  collection  of  machines 
(as  few  as  two  or  three,  or  as  many  as 
one  or  two  hundred),  connected  by  a 
collection  of  high-speed,  low-latency 
communication  devices.  If  shared 
memory  is  employed,  we  assume  it  is 
not  used  over  the  network.  Clocks 
are  not  assumed  to  be  closely  syn¬ 
chronized. 

Within  a  LAN,  we  assume  mes¬ 
sages  may  be  lost  in  transit,  arrive  out 
of  order,  be  duplicated,  or  be  dis¬ 
carded  because  of  inadequate  buf¬ 
fering  capacity.  We  also  assume  that 
LAN  communication  partitions  are 
rare.  The  algorithms  described  later 
in  this  article  and  the  ISIS  system  it¬ 
self  may  pause  (or  make  progress  in 
only  the  largest  partition)  during 
periods  of  partition  failure,  resum¬ 
ing  normal  operation  only  when  nor¬ 
mal  communication  is  restored. 

We  will  assume  the  lowest  levels  of 
the  system  are  responsible  for  flow 
control  and  for  overcoming  message 


loss  and  unordered  delivery.  In  ISIS, 
these  tasks  are  accomplished  using  a 
windowed  acknowledgement  proto¬ 
col  similar  to  the  one  used  in  TCP, 
but  integrated  with  a  failure-detec¬ 
tion  subsystem.  With  this  (nonstand¬ 
ard)  approach,  a  consistent  system- 
wide  view  of  the  state  of  components 
in  the  system  and  of  the  state  of  com¬ 
munication  channels  between  them 
can  be  presented  to  higher  layers  of 
software.  For  example,  the  ISIS 
transport  layer  will  only  break  a  com¬ 
munication  channel  to  a  process  in 
situations  in  which  it  would  also  re¬ 
port  to  any  application  monitoring 
that  process  that  the  process  has 
failed.  Moreover,  if  one  channel  to  a 
process  is  broken,  all  channels  are 
broken. 

Failure  Model 

Throughout  this  article,  processes 
and  processors  are  assumed  to  fail  by 
halting,  without  initiating  erroneous 
actions  or  sending  incorrect  mes¬ 
sages.  This  raises  a  problem:  tran¬ 
sient  problems — such  as  an  unre¬ 
sponsive  swapping  device  or  a 
temporary  communication  outage — 
can  mimic  halting  failures.  Because 
we  will  want  to  build  systems  guaran¬ 
teed  to  make  progress  when  failures 
occur,  this  introduces  a  conflict  be¬ 
tween  “accurate”  and  “timely”  failure 
detection. 

One  way  ISIS  overcomes  this 
problem  is  by  integrating  the  com¬ 
munication  transport  layer  with  the 
failure  detection  layer  to  make  pro¬ 
cesses  appear  to  fail  by  halting,  even 
when  this  may  not  be  the  case:  a.  fail- 
stop  model  [30].  To  implement  such  a 
model,  a  system  uses  an  agreement 
protocol  to  maintain  a  system  mem¬ 
bership  list:  only  processes  included 
in  this  list  are  permitted  to  partici¬ 
pate  in  the  system,  and  nonrespon- 
sive  or  failed  processes  are  dropped 
[12,  28].  If  a  process  dropped  from 
the  list  later  resumes  communication, 
the  application  is  forced  to  either 
shut  down  gracefully  or  to  run  a  “re¬ 
connection”  protocol.  The  message 
transport  layer  plays  an  important 
role,  both  by  breaking  connections 
and  by  intercepting  messages  from 
faulty  processes. 

In  the  remainder  of  this  article  we 
assume  a  message  transport  and 
failure-detection  layer  with  the  prop- 
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erties  of  the  one  used  by  ISIS.  To 
summarize,  a  process  starts  execu¬ 
tion  by  joining  the  system,  interacts 
with  it  over  a  period  of  time  during 
which  messages  are  delivered  in  the 
order  sent,  without  loss  or  duplica¬ 
tion,  and  then  terminates  (if  it  termi¬ 
nates)  by  halting  detectably.  Once  a 
process  terminates,  we  will  consider 
it  to  be  permanently  gone  from  the 
system,  and  assume  that  any  state  it 
may  have  recorded  (say,  on  a  disk) 
ceases  to  be  relevant.  If  a  process 
experiences  a  transient  problem  and 
then  recovers  and  rejoins  the  system, 
it  is  treated  as  a  completely  new  en¬ 
tity — no  attempt  is  made  to  automat¬ 
ically  reconcile  the  state  of  the  system 
with  its  state  prior  to  the  failure  (re¬ 
covery  of  this  nature  is  left  to  higher 
layers  of  the  system  and  applica¬ 
tions). 

Building  Croups  Over  Conventional 
Technologies 

Group  Addressing.  Consider  the 
problem  of  mapping  a  group  address 
to  a  membership  list,  in  an  applica¬ 
tion  in  which  the  membership  could 
change  dynamically  due  to  processes 
joining  the  group  or  leaving.  The 
obvious  way  to  approach  this  prob¬ 
lem  involves  a  membership  service  [9, 
12].  Such  a  service  maintains  a  map 
from  group  identifiers  to  member¬ 
ship  lists.  Deferring  fault-tolerance 
issues,  one  could  implement  such  a 
service  using  a  simple  program  that 
supports  remotely  callable  proce¬ 
dures  to  register  a  new  group  or 
group  member,  obtain  the  member¬ 
ship  of  a  group,  and  perhaps  for¬ 
ward  a  message  to  the  group.  A  pro¬ 
cess  could  then  transmit  a  message 
either  by  forwarding  it  via  the  nam¬ 
ing  service,  or  by  looking  up  the 
membership  information,  caching  it, 
and  transmitting  messages  directly.3 
The  first  approach  will  perform  bet¬ 
ter  for  one-time  interactions;  the  sec¬ 
ond  would  be  preferable  in  an  appli¬ 
cation  that  sends  a  stream  of 
messages  to  the  group. 

This  form  of  addressing  also  raises 
a  scheduling  question.  The  designer 
of  a  distributed  application  will  want 


3  In  the  latter  case,  one  would  also  need  a  mech¬ 
anism  for  invalidating  cached  addressing  infor¬ 
mation  when  the  group  membership  changes 
(this  is  not  a  trivial  problem,  but  the  need  for 
brevity  precludes  discussing  it  in  detail). 


to  send  messages  to  all  members  of 
the  group,  under  some  reasonable 
interpretation  of  the  term  “all.”  The 
question,  then,  is  how  to  schedule  the 
delivery  of  messages  so  that  the  de¬ 
livery  is  to  a  reasonable  set  of  pro¬ 
cesses.  For  example,  suppose  that  a 
process  group  contains  three  pro¬ 
cesses,  and  a  process  sends  many 
messages  to  it.  One  would  expect 
these  messages  to  reach  all  three 
members,  not  some  other  set  reflect¬ 
ing  a  stale  view  of  the  group  compo¬ 
sition  (e.g.,  including  processes  that 
have  left  the  group). 

The  solution  to  this  problem  fa¬ 
vored  in  our  work  can  be  understood 
by  thinking  of  the  group  member¬ 
ship  as  data  in  a  database  shared  by 
the  sender  of  a  multidestination  mes¬ 
sage  (a  multicast1 4),  and  the  algorithm 
used  to  add  new  members  to  the 
group.  A  multicast  “reads”  the  mem¬ 
bership  of  the  group  to  which  it  is 
sent,  holding  a  form  of  read-lock 
until  the  delivery  of  the  message  oc¬ 
curs.  A  change  of  membership  that 
adds  a  new  member  would  be  treated 
like  a  “write”  operation,  requiring  a 
write-lock  that  prevents  such  an  op¬ 
eration  from  executing  while  a  prior 
multicast  is  under  way.  It  will  now 
appear  that  messages  are  delivered 
to  groups  only  when  the  membership 
is  not  changing. 

A  problem  with  using  locking  to 
implement  address  expansion  is  cost. 
Accordingly,  ISIS  uses  this  idea,  but 
does  not  employ  a  database  or  any 
sort  of  locking.  And,  rather  than 
implement  a  membership  server, 
which  could  represent  a  single  point 
of  failure,  ISIS  replicates  knowledge 
of  the  membership  among  the  mem¬ 
bers  of  the  group  itself.  This  is  done 
in  an  integrated  manner,  in  order  to 
perform  address  expansion  with  no 
extra  messages  or  unnecessary  delays 
and  guarantee  the  logical  instantane- 
ity  property  that  the  user  expects. 
For  practical  purposes,  any  message 
sent  to  a  group  can  be  thought  of  as 
reaching  all  members  at  the  same 
time. 


4  In  this  article  the  term  multicast  refers  to  send¬ 
ing  a  single  message  to  the  members  of  a  pro¬ 
cess  group.  The  term  broadcast,  common  in  the 
literature,  is  sometimes  confused  with  the  hard¬ 
ware  broadcast  capabilities  of  devices  like 
Ethernet.  While  a  multicast  might  make  use  of 
hardware  broadcast,  this  would  simply  repre¬ 
sent  one  possible  implementation  strategy. 


Logical  time  and  causal  depen¬ 
dency.  The  phrase  “reaching  all  of  its 
members  at  the  same  time”  raises  an 
issue  that  will  prove  to  be  fundamen¬ 
tal  to  message-delivery  ordering. 
Such  a  statement  presupposes  a  tem¬ 
poral  model.  What  notion  of  time 
applies  to  distributed  process  group 
applications? 

In  1978,  Leslie  Lamport  published 
a  seminal  paper  that  considered  the 
role  of  time  in  distributed  algorithms 
[21].  Lamport  asked  how  one  might 
assign  timestamps  to  the  events  in  a 
distributed  system  to  correctly  cap¬ 
ture  the  order  in  which  events  oc¬ 
curred.  Real  time  is  not  suitable  for 
this:  each  machine  will  have  its  own 
clock,  and  clock  synchronization  is  at 
best  imprecise  in  distributed  systems. 
Moreover,  operating  systems  intro¬ 
duce  unpredictable  software  delays, 
processor  execution  speeds  can  vary 
widely  due  to  cache  affinity  effects, 
and  scheduling  is  often  unpredict¬ 
able.  These  factors  make  it  difficult 
to  compare  timestamps  assigned  by 
different  machines. 

As  an  alternative,  Lamport  sug¬ 
gested,  one  could  discuss  distributed 
algorithms  in  terms  of  the  depen¬ 
dencies  between  the  events  making 
up  the  system  execution.  For  exam¬ 
ple,  suppose  a  process  first  sets  some 
variable  x  to  3,  and  then  sets  y  —  x. 
The  event  corresponding  to  the  lat¬ 
ter  operation  would  depend  on  the 
former  one — an  example  of  a  local 
dependency.  Similarly,  receiving  a 
message  depends  on  sending  it.  This 
view  of  a  system  leads  one  to  define 
the  potential  causality  relationship  be¬ 
tween  events  in  the  system.  It  is  the 
irreflexive  transitive  closure  of  the 
message  send-receive  relation  and 
the  local  dependency  relation  for 
processes  in  the  system.  If  event  a 
happens  before  event  b  in  a  distrib¬ 
uted  system,  the  causality  relation 
will  capture  this. 

In  Lamport’s  view  of  time,  we 
would  say  that  two  events  are  concur¬ 
rent  if  they  are  not  causally  related: 
the  issue  is  not  whether  they  actually 
executed  simultaneously  in  some  run 
of  the  system,  but  whether  the  system 
was  sensitive  to  their  respective  or¬ 
dering.  Given  an  execution  of  a  sys¬ 
tem,  there  exists  a  large  set  of  equiva¬ 
lent  executions  arrived  at  by 
rescheduling  concurrent  events 
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while  retaining  the  event  ordering 
constraints  represented  by  causality 
relation.  The  key  observation  is  that 
the  causal  event  ordering  captures  all  the 
essential  ordering  information  needed  to 
describe  the  execution:  any  two  physical 
executions  with  the  same  causal 
event  ordering  describe  indistin¬ 
guishable  runs  of  the  system. 

Recall  our  use  of  the  phrase 
“reaching  all  of  its  members  at  the 
same  time.”  Lamport  has  suggested 
that  for  a  system  described  in  terms 
of  a  causal  event  ordering,  any  set  of 
concurrent  events,  one  per  process, 
can  be  thought  of  as  representing  a 
logical  instant  in  time.  Thus,  when 
we  say  that  all  members  of  a  group 
receive  a  message  at  the  same  time, 
we  mean  that  the  message  delivery 
events  are  concurrent  and  totally 
ordered  with  respect  to  group  mem¬ 
bership  change  events.  Causal  de¬ 
pendency  provides  the  fundamental 
notion  of  time  in  a  distributed  sys¬ 
tem,  and  plays  an  important  role  in 
the  remainder  of  this  section. 

Message  delivery  ordering.  Con¬ 
sider  Figure  4,  part  (A),  in  which 
messages  m\9  m2,  m$  and  m4  are  sent 
to  a  group  consisting  of  processes  s i, 
s2,  and  53.  Messages  m\  and  m2  are 
sent  concurrently  and  are  received  in 
different  orders  by  s2  and  53.  In  many 
applications,  52  and  53  would  behave 
in  an  uncoordinated  or  inconsistent 
manner  if  this  occurred.  A  designer 
must,  therefore,  anticipate  possible 
inconsistent  message  ordering.  For 
example,  one  might  design  the  appli¬ 
cation  to  tolerate  such  mixups,  or 
explicitly  prevent  them  from  occur¬ 
ring  by  delaying  the  processing  of  mx 
and  m2  within  the  program  until  an 
ordering  has  been  established.  The 
real  danger  is  that  a  designer  could 
overlook  the  whole  issue — after  all, 
two  simultaneous  messages  to  the 
program  that  arrive  in  different  se¬ 
quences  may  seem  like  an  improb¬ 
able  scenario — yielding  an  applica¬ 
tion  that  usually  is  correct,  but  may 
exhibit  abnormal  behavior  when  un¬ 
likely  sequences  of  events  occur,  or 
under  periods  of  heavy  load.  (Under 
load,  multicast  delivery  latencies  rise, 
increasing  the  probability  that  con¬ 
current  multicasts  could  overlap). 

This  is  only  one  of  several  delivery 
ordering  problems  illustrated  in  Fig¬ 
ure  4.  Consider  the  situation  when  S3 
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receives  message  w3.  Message  m3  was 
sent  by  S\  after  receiving  m2i  and 
might  refer  to  or  depend  on  m2.  For 
example,  m2  might  authorize  a  cer¬ 
tain  broker  to  trade  a  particular  ac¬ 
count,  and  m3  could  be  a  trade  the 
broker  has  initiated  on  behalf  of  that 
account.  Our  execution  is  such  that  53 
has  not  yet  received  m2  when  m3  is 
delivered.  Perhaps  m2  was  discarded 
by  the  operating  system  due  to  a  lack 
of  buffering  space.  It  will  be  retrans¬ 
mitted,  but  only  after  a  brief  delay 
during  which  m3  might  be  received. 

Why  might  this  matter?  Imagine 
that  s$  is  displaying  buy/sell  orders  on 
the  trading  floor.  53  will  consider  m3 
invalid,  since  it  will  not  be  able  to 
confirm  that  the  trade  was  autho¬ 
rized.  An  application  with  this  prob¬ 
lem  might  fail  to  carry  out  valid  trad¬ 
ing  requests.  Again,  although  the 
problem  is  solvable,  the  question  is 
whether  the  application  designer  will 
have  anticipated  the  problem  and 
programmed  a  correct  mechanism  to 
compensate  when  it  occurs. 

In  our  work  on  ISIS,  this  problem 
is  solved  by  including  a  context  rec¬ 
ord  on  each  message.  If  a  message 
arrives  out  of  order,  this  record  can 
be  used  to  detect  the  condition,  and 
to  delay  delivery  until  prior  messages 
arrive.  The  context  representation 
we  employ  has  size  linear  in  the  num¬ 
ber  of  members  of  the  group  within 
which  the  message  is  sent  (actually,  in 
the  worst  case  a  message  might  carry 
multiple  such  context  records,  but 
this  is  extremely  rare).  However,  the 
average  size  can  be  greatly  reduced 
by  taking  advantage  of  repetitious 
communication  patterns,  such  as  the 
tendency  of  a  process  that  sends  to  a 
group  to  send  multiple  messages  in 
succession  [11].  The  imposed  over¬ 
head  is  variable,  but  on  the  average 
small.  Other  solutions  to  this  prob¬ 
lem  are  described  in  [9,  26]. 

Message  m4  exhibits  a  situation 
that  combines  several  of  these  issues. 
m4  is  sent  by  a  process  that  previously 
sent  m\  and  is  concurrent  with  m2,  m3, 
and  a  membership  change  of  the 
group.  One  sees  here  a  situation  in 
which  all  of  the  ordering  issues  cited 
thus  far  arise  simultaneously,  and  in 
which  failing  to  address  any  of  them 
could  lead  to  errors  within  an  impor¬ 
tant  class  of  applications.  As  shown, 
only  the  group  addressing  property 


proposed  in  the  previous  section  is 
violated:  were  m4  to  trigger  a  concur¬ 
rent  database  search,  process  sx 
would  search  the  first  third  of  the 
database,  while  s2  searches  the  sec¬ 
ond  half — one-sixth  of  the  database 
would  not  be  searched.  However,  the 
figure  could  easily  be  changed  to 
simultaneously  violate  other  order¬ 
ing  properties. 

State  transfer.  Figure  4,  part  (B) 
illustrates  a  slightly  different  prob¬ 
lem.  Here,  we  wish  to  transfer  the 
state  of  the  service  to  process  53:  per¬ 
haps  53  represents  a  program  that 
has  restarted  after  a  failure  (having 
lost  prior  state)  or  a  server  that  has 
been  added  to  redistribute  load.  In¬ 
tuitively,  the  state  of  the  server  will 
be  a  data  structure  reflecting  the  data 
managed  by  the  service,  as  modified 
by  the  messages  received  prior  to 
when  the  new  member  joined  the 
group.  However,  in  the  execution 
shown,  a  message  has  been  sent  to 
the  server  concurrent  with  the  mem¬ 
bership  change.  A  consequence  is 
that  S3  receives  a  state  which  does  not 
reflect  message  m4 ,  leaving  it  incon¬ 
sistent  with  s  1  and  52-  Solving  this 
problem  involves  a  complex  synchro¬ 
nization  algorithm  (not  presented 
here),  probably  beyond  the  ability  of 
a  typical  distributed  applications  pro¬ 
grammer. 

Fault  tolerance.  Up  to  now,  our 
discussion  has  ignored  failures.  Fail¬ 
ures  cause  many  problems;  here,  we 
consider  just  one.  Suppose  the 
sender  of  a  message  were  to  crash 
after  some,  but  not  all,  destinations 
receive  the  message.  The  destina¬ 
tions  that  do  have  a  copy  will  need  to 
complete  the  transmission  or  discard 
the  message.  The  protocol  used 
should  achieve  “exactly-once  deliv¬ 
ery”  of  each  message  to  those  desti¬ 
nations  that  remain  operational,  with 
bounded  overhead  and  storage. 
Conversely,  we  need  not  be  con¬ 
cerned  with  delivery  to  a  process  that 
fails  during  the  protocol,  since  such  a 
process  will  never  be  heard  from 
again  (recall  the  fail-stop  model). 

Protocols  to  solve  this  problem  can 
be  complex,  but  a  fairly  simple  solu¬ 
tion  will  illustrate  the  basic  tech¬ 
niques.  This  protocol  uses  three 
rounds  of  RPCs  as  illustrated  in  Fig¬ 
ure  5.  During  the  first  round,  the 


sender  sends  the  message  to  the  des¬ 
tinations,  which  acknowledge  re¬ 
ceipt.  Although  the  destinations  can 
deliver  the  message  at  this  point,  they 
need  to  keep  a  copy:  should  the 
sender  fail  during  the  first  round, 
the  destination  processes  that  have 
received  copies  will  need  to  finish  the 
protocol  on  the  sender’s  behalf.  In 
the  second  round,  if  no  failure  has 
occurred,  then  the  sender  tells  all 
destinations  that  the  first  round  has 
finished.  They  acknowledge  this 
message  and  make  a  note  that  the 
sender  is  entering  the  third  round. 
During  the  third  round,  each  desti¬ 
nation  discards  all  information  about 
the  message — deleting  the  saved 
copy  of  the  message  and  any  other 
data  it  was  maintaining. 

When  a  failure  occurs,  a  process 
that  has  received  a  first-  or  second- 
round  message  can  terminate  the 
protocol.  The  basic  idea  is  to  have 
some  member  of  the  destination  set 
take  over  the  round  that  the  sender 
was  running  when  it  failed;  processes 
that  have  already  received  messages 
in  that  round  detect  duplicates  and 
respond  to  them  as  they  responded 
after  the  original  reception.  The  pro¬ 
tocol  is  straightforward,  and  we  leave 
the  details  to  the  reader. 

This  three-round  multicast  proto¬ 
col  does  not  obtain  any  form  of  pipe¬ 
lined  or  asynchronous  data  flow 
when  invoked  many  times  in  succes¬ 
sion,  and  the  use  of  RPC  limits  the 
degree  of  communication  concur¬ 
rency  during  each  round  (it  would  be 
better  to  send  all  the  messages  at 
once,  and  to  collect  the  replies  in  par¬ 
allel).  These  features  make  the  pro¬ 
tocol  expensive.  Much  better  solu¬ 
tions  have  been  described  in  the 
literature  (see  [9,  1 1J  for  more  detail 
on  the  approach  used  in  ISIS,  and 
for  a  summary  of  other  work  in  the 
area). 

Recall  that  in  the  subsection  “Con¬ 
ventional  Message-Passing  Technol¬ 
ogies,”  we  indicated  that  systemwide 
agreement  on  membership  was  an 
important  property  of  our  overall 
approach.  It  is  interesting  to  realize 
that  a  protocol  such  as  this  is  greatly 
simplified  because  failures  are  re¬ 
ported  consistently  to  all  processes  in 
the  system.  If  failure  detection  were 
by  an  inconsistent  mechanism,  it 
would  be  very  difficult  to  convince 


oneself  that  the  protocol  is  correct 
(indeed,  as  stated,  the  protocol  could 
deliver  duplicates  if  failures  are  re¬ 
ported  inaccurately).  The  merit  of 
solving  such  a  problem  at  a  low  level 
is  that  we  can  then  make  use  of  the 
consistency  properties  of  the  solution 
when  reasoning  about  protocols  that 
react  to  failures. 

Summary  of  issues.  The  previous 
discussion  pointed  to  some  of  the 
potential  pitfalls  that  confront  the 
developer  of  group  software  working 
over  a  conventional  operating  sys¬ 
tem:  (1)  weak  support  for  reliable 
communication,  notably  inconsis¬ 
tency  in  the  situations  in  which  chan¬ 
nels  break,  (2)  group  address  expan¬ 
sion,  (3)  delivery  ordering  for 
concurrent  messages,  (4)  delivery 
ordering  for  sequences  of  related 
messages,  (5)  state  transfers,  and 
(6)  failure  atomicity.  '  This  list  is  not 
exhaustive:  we  have  overlooked 
questions  involving  real-time  deliv¬ 
ery  guarantees,  and  persistent  data- 


Figure  4.  Message-ordering 
problems 

Figures.  Three-round  reliable 
multicast 
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bases  and  files.  However,  our  work 
on  ISIS  treats  process  group  issues 
under  the  assumption  that  any  real¬ 
time  deadlines  are  long  compared  to 
communication  latencies,  and  that 
process  states  are  volatile,  hence  we 
view  these  issues  as  beyond  the  scope 
of  the  current  article.5  The  list  does 
cover  the  major  issues  that  arise  in 
this  more  restrictive  domain.  [5] 

At  the  beginning  of  this  section, 
we  asserted  that  modern  operating 


systems  lack  the  tools  needed  to  de¬ 
velop  group-based  software.  This 
assertion  goes  beyond  standards  such 
as  Unix  to  include  next-generation 
systems  such  as  NT,  Mach,  CHORUS 
and  Amoeba.6  A  basic  premise  of  this 
article  is  that,  although  all  of  these 
problems  can  be  solved,  the  com- 

5 These  issues  can  be  addressed  within  the  tools 
layer  of  ISIS,  and  in  fact  the  current  system  in¬ 
cludes  an  optional  subsystem  for  management 
of  persistent  data. 


6  Mach  I  PC  provides  strong  guarantees  of  reli¬ 
ability  in  its  communication  subsystem.  How¬ 
ever,  Mach  may  experience  unbounded  delay 
when  a  node  failure  occurs.  CHORUS  includes 
a  port-group  mechanism,  but  with  weak  seman¬ 
tics,  patterned  after  earlier  work  on  the  V  sys¬ 
tem  (15].  Amoeba,  which  initially  lacked  group 
support,  has  recently  been  extended  to  a  mech¬ 
anism  apparently  motivated  by  our  work  on 
ISIS  [19]. 


plexity  associated  with  working  out 
the  solutions  and  integrating  them  in 
a  single  system  will  be  a  significant 
barrier  to  application  developers. 
The  only  practical  approach  is  to 
solve  these  problems  in  die  distrib¬ 
uted  computing  environment  itself, 
or  in  the  operating  system.  This  per¬ 
mits  a  solution  to  be  engineered  in  a 
way  that  will  give  good,  predictable 
performance  and  takes  full  advan¬ 
tage  of  hardware  and  operating  sys¬ 


tem  features.  Furthermore,  provid¬ 
ing  process  groups  as  an  underlying 
tool  permits  the  programmer  to  con¬ 
centrate  on  the  problem  at  hand.  If 
the  implementation  of  process 
groups  is  left  to  the  application  de¬ 
signer,  nonexperts  are  unlikely  to 
use  the  approach.  The  brokerage 
application  of  the  introduction 
would  be  extremely  difficult  to  build 
using  the  tools  provided  by  a  conven¬ 
tional  operating  system. 

Virtual  Synchrony 

It  was  observed  earlier  in  this  article 
that  integration  of  group  program¬ 
ming  mechanisms  into  a  single  envi¬ 
ronment  is  also  an  important  prob¬ 
lem.  Our  work  addresses  this  issue 
through  an  execution  model  called 


virtual  synchrony,  motivated  by  prior 
work  on  transaction  serializability. 
We  will  present  the  approach  in  two 
stages.  First,  we  discuss  an  execution 
model  called  close  synchrony .  This 
model  is  then  relaxed  to  arrive  at  the 
virtual  synchrony  model.  A  compari¬ 
son  of  our  work  with  the  serializabil¬ 
ity  model  appears  in  the  section 
“ISIS  and  Other  Distributed  Com¬ 
puting  Technologies.”  The  basic  idea 
is  to  encourage  programmers  to  as¬ 
sume  a  closely  synchronized  style  of 
distributed  execution  [10,  31]: 

•  Execution  of  a  process  consists  of  a 
sequence  of  events,  which  may  be 
internal  computation,  message  trans¬ 
missions,  message  deliveries,  or 
changes  to  the  membership  of 
groups  that  it  creates  or  joins. 

•  A  global  execution  of  the  system 
consists  of  a  set  of  process  execu¬ 
tions.  At  the  global  level,  one  can  talk 
about  messages  sent  as  multicasts  to 
process  groups. 

•  Any  two  processes  that  receive  the 
same  multicasts  or  observe  the  same 
group  membership  changes  see  the 
corresponding  local  events  in  the 
same  relative  order. 

•  A  multicast  to  a  process  group  is 
delivered  to  its  full  membership.  The 
send  and  delivery  events  are  consid¬ 
ered  to  occur  as  a  single,  instanta¬ 
neous  event. 

Close  synchrony  is  a  powerful 
guarantee.  In  fact,  as  seen  in  Figure 
6,  it  eliminates  all  the  problems  iden¬ 
tified  in  the  preceding  section: 

•  Weak  communication  reliability  guar¬ 
antees:  A  closely  synchronous  com¬ 
munication  subsystem  appears  to  the 
programmer  as  completely  reliable. 

•  Group  address  expansion:  In  a  closely 
synchronous  execution,  the  member¬ 
ship  of  a  process  group  is  fixed  at  the 
logical  instant  when  a  multicast  is 
delivered. 

•  Delivery  ordering  for  concurrent  mes¬ 
sages:  In  a  closely  synchronous  exe¬ 
cution,  concurrently  issued  multi- 
casts  are  distinct  events.  They  would, 
therefore,  be  seen  in  the  same  order 
by  any  destinations  they  have  in  com¬ 
mon. 

•  Delivery  ordering  for  sequences  of  re¬ 
lated  messages:  In  Figure  6,  part  (A), 
process  s\  sent  message  after  re- 


1515  Tools  at  Process  Croup  Level 

Process  groups:  Create,  delete,  join  (transferring  state). 

Croup  multicast:  CBCAST,  ABC  AST,  collecting  o,  1  quorum  or  all  replies  (0  re¬ 
plies  gives  an  asynchronous  multicast). 

Synchronization:  Locking,  with  symbolic  strings  to  represent  locks.  Deadlock 
detection  or  avoidance  must  be  addressed  at  the  application  level.  Token  pass¬ 
ing. 

Replicated  data:  implemented  by  broadcasting  updates  to  group  having  cop¬ 
ies.  Transfer  values  to  processes  that  join  using  state  transfer  facility.  Dynamic 
system  reconfiguration  using  replicated  configuration  data.  Checkpoint/update 
logging,  spooling  for  state  recovery  after  failure. 

Monitoring  facilities:  Watch  a  process  or  site,  trigger  actions  after  failures  and 
recoveries.  Monitor  changes  to  process  group  membership,  site  failures,  and 
so  forth. 

Distributed  execution  facilities:  Redundant  computation  (all  take  same  action). 
Subdivided  among  multiple  servers.  Coordinator-cohort  (primary/backup). 

Automated  recovery:  When  a  site  recovers,  programs  automatically  restart. 

For  the  first  site  to  recover,  group  state  Is  restored  from  logs  (or  Initialized  by 
software).  For  other  sites,  a  process  group  Join  and  transfer  state  is  Initiated. 

WAN  communication:  Reliable  long-haul  message  passing  and  file  transfer  facility. 
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ceiving  w2»  hence  m3  may  be  causally 
dependent  on  m2.  Processes  execut¬ 
ing  in  a  closely  synchronous  model 
would  never  see  anything  inconsist¬ 
ent  with  this  causal  dependency  rela¬ 
tion. 

•  State  transfer :  State  transfer  occurs 
at  a  well-defined  instant  in  time  in 
the  model.  If  a  group  member 
checkpoints  the  group  state  at  the 
instant  when  a  new  member  is 
added,  or  sends  something  based  on 
the  state  to  the  new  member,  the 
state  will  be  well  defined  and  com¬ 
plete. 

•  Failure  atomicity:  The  close  syn¬ 
chrony  model  treats  a  multicast  as  a 
single  logical  event,  and  reports  fail¬ 
ures  through  group  membership 
changes  that  are  ordered  with  re¬ 
spect  to  multicast.  The  all  or  nothing 
behavior  of  an  atomic  multicast  is 
thus  implied  by  the  model. 

Unfortunately,  although  closely 
synchronous  execution  simplifies 
distributed  application  design,  the 
approach  cannot  be  applied  directly 
in  a  practical  setting.  First,  achieving 
dose  synchrony  is  impossible  in  the 
presence  of  failures.  Say  that  pro¬ 
cesses  Si  and  52  are  in  group  G  and 
message  m  is  multicast  to  G.  Consider 
s  1  at  the  instant  before  it  delivers  m. 
According  to  the  close  synchrony 
model,  it  can  only  deliver  m  if  s2  will 
do  so  also.  But  S\  has  no  way  to  be 
sure  that  52  is  still  operational,  hence 
s\  will  be  unable  to  make  progress 
[36].  Fortunately,  we  can  finesse  this 
issue:  if  s2  has  failed,  it  will  hardly  be 
in  a  position  to  dispute  the  assertion 
that  m  was  delivered  to  it  first! 

A  second  concern  is  that  maintain¬ 
ing  close  synchrony  is  expensive.  The 
simplicity  of  the  approach  stems  in 
part  from  the  fact  that  the  entire 
process  group  advances  in  lockstep. 
But,  this  also  means  that  the  rate  of 
progress  each  group  member  can 
make  is  limited  by  the  speed  of  the 
other  members,  and  this  could  have  a 
huge  performance  impact.  What  is 
needed  is  a  model  with  the  concep¬ 
tual  simplicity  of  close  synchrony,  but 
that  is  capable  of  efficiently  support¬ 
ing  very  high  throughput  applica¬ 
tions. 

In  distributed  systems,  high 
throughput  comes  from  asynchronous 
interactions:  patterns  of  execution  in 


which  the  sender  of  a  message  is  per-  Figure  6.  Closely  synchronous 
mitted  to  continue  executing  without  execution 

waiting  for  delivery.  An  asynchro-  F|gure7>  Asynchronous  pipe- 
nous  approach  treats  the  communi-  lining 
cations  system  like  a  bounded  buffer, 
blocking  the  sender  only  when  the 
rate  of  data  production  exceeds  the 
rate  of  consumption,  or  when  the 
sender  needs  to  wait  for  a  reply  or 
some  other  input  (Figure  7).  The 
advantage  of  this  approach  is  that  the 
latency  (delay)  between  the  sender 
and  the  destination  does  not  affect 
the  data  transmission  rate — the  sys¬ 
tem  operates  in  a  pipelined  manner, 
permitting  both  the  sender  and  des¬ 
tination  to  remain  continuously  ac¬ 
tive.  Closely  synchronous  execution 
precludes  such  pipelining,  delaying 
execution  of  the  sender  until  the 
message  can  be  delivered. 

This  motivates  the  virtual  syn¬ 
chrony  approach.  A  virtually  syn¬ 
chronous  system  permits  asynchro¬ 
nous  executions  for  which  there 
exists  some  closely  synchronous  exe¬ 
cution  indistinguishable  from  the 
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asynchronous  one.  In  general,  this 
means  that  for  each  application, 
events  need  to  be  synchronized  only 
to  the  degree  that  the  application  is 
sensitive  to  event  ordering.  In  some 
situations,  this  approach  will  be  iden¬ 
tical  to  close  synchrony.  In  others,  it 
is  possible  to  deliver  messages  in  dif¬ 
ferent  orders  at  different  processes, 
without  the  application  noticing. 
When  such  a  relaxation  of  order  is 
permissible,  a  more  asynchronous 
execution  results. 

Order  sensitivity  in  distributed  sys¬ 
tems .  We  are  led  to  a  final  technical 
question:  “when  can  synchronization 
be  relaxed  in  a  virtually  synchronous 
distributed  system?'’  Two  forms  of 
ordering  turn  out  to  be  useful;  one  is 
“stronger”  than  the  other,  but  also 
more  costly  to  support. 

Consider  a  system  with  two  pro¬ 
cesses,  s\  and  s2,  sending  messages 
into  a  group  G  with  members  gi  and 
g2.  s i  sends  message  mi  to  G  and,  con¬ 
currently.  S2  sends  wi2.  In  a  closely 
synchronous  system,  g\  and  g2  would 
receive  these  messages  in  identical 
orders.  If,  for  example,  the  messages 
caused  updates  to  a  data  structure 
replicated  within  the  group,  this 
property  could  be  used  to  ensure 
that  the  replicas  remain  identical 
through  the  execution  of  the  system. 
A  multicast  with  this  property  is  said 
to  achieve  an  atomic  delivery  ordering , 
and  is  denoted  ABC  AST.  ABC  AST 
is  an  easy  primitive  to  work  with,  but 
costly  to  implement.  This  cost  stems 
from  the  following  consideration:  An 
ABC  AST  message  can  only  be  deliv¬ 
ered  when  it  is  known  that  no  prior 
ABCAST  remains  undelivered.  This 
introduces  latency:  messages  Wj  and 
m2  must  be  delayed  before  they  can 
be  delivered  to  gx  and  g2-  Such  a  de¬ 
livery  latency  may  not  be  visible  to 
the  application.  But,  in  cases  in  which 

and  52  need  responses  from  g\  and/ 
or  g2»  or  where  the  senders  and  desti¬ 
nations  are  the  same,  the  application 
will  experience  a  significant  delay 
each  time  an  ABCAST  is  sent.  The 
latencies  involved  can  be  very  high, 
depending  on  how  the  ABCAST 
protocol  is  engineered. 

Not  all  applications  require  such  a 
strong,  costly,  delivery  ordering. 
Concurrent  systems  often  use  some 
form  of  synchronization  or  mutual 


exclusion  mechanism  to  ensure  that 
conflicting  operations  are  performed 
in  some  order.  In  a  parallel  shared- 
memory  environment,  this  is  nor¬ 
mally  done  using  semaphores 
around  critical  sections  of  code.  In  a 
distributed  system,  it  would  normally 
be  done  by  using  some  form  of  lock¬ 
ing  or  token  passing.  Consider  such  a 
distributed  system,  having  the  prop¬ 
erty  that  two  messages  can  be  sent 
concurrently  to  the  same  group  only 
when  their  effects  on  the  group  are  inde¬ 
pendent.  In  the  preceding  example, 
either  s i  and  s2  would  be  prevented 
from  sending  concurrently  (i.e.,  if  m\ 
and  f»2  have  potentially  conflicting 


effects  on  the  states  of  the  members 
of  C),  or  if  they  are  permitted  to  send 
concurrently,  the  delivery  orders 
could  be  arbitrarily  interleaved,  be¬ 
cause  the  actions  on  receiving  such 
messages  commute. 

It  might  seem  that  the  degree  of 
delivery  ordering  needed  would  be 
first-in,  first-out,  (FIFO).  However, 
this  is  not  quite  right,  as  illustrated  in 
Figure  8.  Here  we  see  a  situation  in 
which  $|,  holding  mutual  exclusion, 
sends  message  wq,  but  then  releases 
its  mutual  exclusion  lock  to  s2,  tvhich 
sends  m2.  Perhaps,  mi  and  m2  are 
updates  to  the  same  data  item;  the 
order  of  delivery  could  therefore  be 
quite  important.  Although  there  is 
certainly  a  sense  in  which  mi  was  sent 
“first,”  notice  that  a  FIFO  delivery 
order  would  not  enforce  the  desired 
ordering,  since  FIFO  order  is  usually 
defined  for  a  (sender,  destination) 
pair,  and  here  we  have  two  senders. 
The  ordering  property  needed  for 
this  example  is  that  if  m  \  causally  pre¬ 
cedes  m2,  then  m\  should  be  deliv¬ 
ered  before  m2  at  shared  destina¬ 


tions,  corresponding  to  a  multicast 
primitive  denoted  CBCAST.  Notice 
that  CBCAST  is  weaker  than  AB¬ 
CAST,  because  it  permits  messages 
that  wrere  sent  concurrently  to  be  de¬ 
livered  to  overlapping  destinations  in 
different  sequences.7 

The  major  advantage  of  CBCAST 
over  ABCAST  is  that  it  is  not  subject 
to  the  type  of  latency  cited  previ¬ 
ously.  A  CBCAST  message  can  be 
delivered  as  soon  as  any  prior  mes¬ 
sages  have  been  delivered,  and  all  the 
information  needed  to  determine 
whether  any  prior  messages  are  out¬ 
standing  can  be  included,  at  low 
overhead,  on  the  CBCAST  message 
itself.  Except  in  unusual  cases  where 
a  prior  message  is  somehowr  delayed 
in  the  network,  a  CBCAST  message 
will  be  delivered  immediately  on  re¬ 
ceipt. 

The  ability  to  use  a  protocol  such 
as  CBCAST  is  highly  dependent  on 
the  nature  of  the  application.  Some 
applications  have  a  mutual  exclusion 
structure  for  which  causal  delivery 
ordering  is  adequate,  while  others 
would  need  to  introduce  a  form  of 
locking  to  be  able  to  use  CBCAST 
instead  of  ABCAST.  Basically, 
CBCAST  can  be  used  when  any  con¬ 
flicting  multicasts  are  uniquely  or¬ 
dered  along  a  single  causal  chain.  In 
this  case,  the  CBCAST  guarantee  is 
strong  enough  to  ensure  that  all  the 
conflicting  multicasts  are  seen  in  the 
same  order  by  all  recipients — 
specifically,  the  causal  dependency 
order.  Such  an  execution  system  is 
virtually  synchronous,  since  the  out¬ 
come  of  the  execution  is  the  same  as 
if  an  atomic  delivery  order  had  been 
used. 

The  CBCAST  communication 
pattern  arises  most  often  in  a  process 
group  that  manages  replicated  (or 
coherently  cached)  data  using  locks 
to  order  updates.  Processes  that  up¬ 
date  such  data  first  acquire  the  lock, 
then  issue  a  stream  of  asynchronous 
updates,  and  then  release  the  lock. 

7 The  statement  that  CBCAST  is  “weaker”  than 
ABCAST  may  seem  imprecise:  as  we  have 
stated  the  problem,  the  two  protocols  simply 
provide  different  forms  of  ordering.  However, 
the  ISIS  version  of  ABCAST  actually  extends 
the  partial  CBCAST  ordering  into  a  total  one:  it 
is  a  causal  atomic  multicast  primitive  An  argu¬ 
ment  can  be  made  that  an  ABCAST  protocol 
that  is  not  causa)  cannot  be  used  asynchro¬ 
nously,  hence  we  see  strong  reasons  for  imple¬ 
menting  ABCAST  in  this  manner. 


46  December  1993/Vo!.36,  No.12 


There  will  generally  be  one  update 
lock  for  each  class  of  related  data 
items,  so  that  acquisition  of  the  up¬ 
date  lock  rules  out  conflicting  up¬ 
dates.8  However,  mutual  exclusion 
can  sometimes  be  inferred  from 
other  properties  of  an  algorithm, 
hence  such  a  pattern  may  arise  even 
without  an  explicit  locking  stage.  By 
using  CBCAST  for  this  communica¬ 
tion,  an  efficient,  pipelined  data  flow 
is  achieved.  In  particular,  there  will 
be  no  need  to  block  the  sender  of  a 
multicast,  even  momentarily,  unless 
the  group  membership  is  changing  at 
the  time  the  message  is  sent. 

The  tremendous  performance 
advantage  of  CBCAST  over  AB- 
CAST  may  not  be  immediately  evi¬ 
dent.  However,  when  one  considers 
how  fast  modern  processors  are  in 
comparison  with  communication 
devices,  it  should  be  clear  that  any 
primitive  that  unnecessarily  waits 
before  delivering  a  message  could 
introduce  substantial  overhead.  For 
example,  it  is  common  for  an  appli¬ 
cation  that  replicates  a  table  of  pend¬ 
ing  requests  within  a  group  to  multi¬ 
cast  each  new  request,  so  that  all 
members  can  maintain  identical  cop¬ 
ies  of  the  table.  In  such  cases,  if  the 
way  that  a  request  is  handled  is  sensi¬ 
tive  to  the  contents  of  the  table,  the 
sender  of  the  multicast  must  wait 
until  the  multicast  is  delivered  before 
acting  on  the  request.  Using  AB- 
CAST  the  sender  will  need  to  wait 
until  the  delivery  order  can  be  deter¬ 
mined.  Using  CBCAST,  the  update 
can  be  issued  asynchronously,  and 
applied  immediately  to  the  copy 
maintained  by  the  sender.  The 
sender  thus  avoids  a  potentially  long 
delay,  and  can  immediately  continue 
computation  or  reply  to  the  request. 
When  a  sender  generates  bursts  of 
updates,  also  a  common  pattern,  the 
advantage  of  CBCAST  over  AB- 
CAST  is  even  greater,  because  multi¬ 
ple  messages  can  be  buffered  and 


8  In  ISIS  applications,  locks  are  used  primarily 
for  mutual  exclusion  on  possibly  conflicting 
operations,  such  as  updates  on  related  data 
items.  In  the  case  of  replicated  data,  this  results 
in  an  algorithm  similar  to  a  primary  copy  up¬ 
date  in  which  the  “primary”  copy  changes  dy¬ 
namically.  The  execution  model  is  nontransac¬ 
tional,  and  there  is  no  need  for  read-locks  or  for 
a  two-phase  locking  rule.  This  is  discussed  fur¬ 
ther  in  the  section  “ISIS  and  Other  Distributed 
Computing  Technologies.” 


sent  in  one  packet,  giving  a  pipelin¬ 
ing  effect. 

The  distinction  between  causal 
and  total  event  orderings  (CBCAST 
and  ABCAST)  has  parallels  in  other 
settings.  Although  ISIS  was  the  first 
distributed  system  to  enforce  a  causal 
delivery  ordering  as  part  of  a  com¬ 
munication  subsystem  [7],  the  ap¬ 
proach  draws  on  Lamport’s  prior 
work  on  logical  notions  of  time. 
Moreover,  the  approach  was  in  some 
respects  anticipated  by  work  on  pri¬ 
mary  copy  replication  in  database 
systems  [6].  Similarly,  dose  syn¬ 
chrony  is  related  both  to  Lamport 
and  Schneider’s  state  machine  approach 
to  developing  distributed  software 
[32]  and  to  the  database  serializability 
model,  to  be  discussed  further.  Work 
on  parallel  processor  architectures 
has  yielded  a  memory  update  model 
called  weak  consistency  [16,  35],  which 
uses  a  causal  dependency  principle  to 
increase  parallelism  in  the  cache  of  a 
parallel  processor.  And,  a  causal  cor¬ 
rectness  property  has  been  used  in 
work  on  lazy  update  in  shared  mem¬ 
ory  multiprocessors  [1]  and  distrib¬ 
uted  database  systems  [18,  20].  A 
more  detailed  discussion  of  the  con¬ 
ditions  under  which  CBCAST  can  be 
used  in  place  of  ABCAST  appears  in 
[10,  31]. 

Summary  of  Benefits  Due  to  Virtual 
Synchrony 

Brevity  precludes  a  more  detailed 
discussion  of  virtual  synchrony,  or 
how  it  is  used  in  developing  distrib¬ 
uted  algorithms  within  ISIS.  It  may 
be  useful,  however,  to  summarize  the 
benefits  of  the  model: 

•  Allows  code  to  be  developed  as¬ 
suming  a  simplified,  closely  synchro¬ 
nous  execution  model; 

•  Supports  a  meaningful  notion  of 
group  state  and  state  transfer,  both 
when  groups  manage  replicated 
data,  and  when  a  computation  is 
dynamically  partitioned  among 
group  members; 

•  Asynchronous,  pipelined  commu¬ 
nication; 

•  Treatment  of  communication,  pro¬ 
cess  group  membership  changes  and 
failures  through  a  single,  event- 
oriented  execution  model; 

•  Failure  handling  through  a  consis¬ 
tently  presented  system  membership 


list  integrated  with  the  communica¬ 
tion  subsystem.  This  is  in  contrast  to 
the  usual  approach  of  sensing  fail¬ 
ures  through  timeouts  and  broken 
channels,  which  does  not  guarantee 
consistency. 

The  approach  also  has  limitations: 

•  Reduced  availability  during  LAN 
partition  failures:  only  allows  prog¬ 
ress  in  a  single  partition,  and  hence 
tolerates  at  most  [n/2_\  —  1  simulta¬ 
neous  failures,  if  n  is  the  number  of 
sites  currently  operational; 

•  Risks  incorrectly  classifying  an 
operational  site  or  process  as  faulty. 

The  virtual  synchrony  model  is 
unusual  in  offering  these  benefits 
within  a  single  framework.  More¬ 
over,  theoretical  arguments  exist  that 
no  system  that  provides  consistent 
distributed  behavior  can  completely 
evade  these  limitations.  Our  experi¬ 
ence  has  been  that  the  issues  ad¬ 
dressed  by  virtual  synchrony  are  en¬ 
countered  in  even  the  simplest 
distributed  applications,  and  that  the 
approach  is  general,  complete,  and 
theoretically  sound. 

The  ISIS  Toolkit 

The  ISIS  toolkit  provides  a  collection 
of  higher-level  mechanisms  for 
forming  and  managing  process 
groups  and  implementing  group- 
based  software.  This  section  illus¬ 
trates  the  specifics  of  the  approach 
by  discussing  the  styles  of  process 
groups  supported  by  the  system  and 
giving  a  simple  example  of  a  distrib¬ 
uted  database  application. 

ISIS  is  not  the  first  system  to  use 
process  groups  as  a  programming 
tool:  at  the  time  the  system  was  ini¬ 
tially  developed,  Cheriton’s  V  system 
had  received  wide  visibility  [15]. 
More  recently,  group  mechanisms 
have  become  common,  exemplified 
by  the  Amoeba  system  [19],  the 
CHORUS  operating  system  [26],  the 
Psync  system  [29],  a  high  availability 
system  developed  by  Ladin  and  Lis- 
kov  [20],  IBM’s  AAS  system  [14],  and 
Transis  [3],  Nonetheless,  ISIS  was 
first  to  propose  the  virtual  synchrony 
model  and  to  offer  high-perfor¬ 
mance,  consistent  solutions  to  a  wide 
variety  of  problems  through  its  tool¬ 
kit.  The  approach  is  now  gaining 
wide  acceptance.9 
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Figure  9.  Styles  of  groups 
Styles  of  Croups 

The  efficiency  of  a  distributed  sys¬ 
tem  is  limited  by  the  information 
available  to  the  protocols  employed 
for  communication.  This  was  a  con¬ 
sideration  in  developing  the  ISIS 
process  group  interface,  in  which  a 
trade-off  had  to  be  made  between 
simplicity  of  the  interface  and  the 
availability  of  accurate  information 
about  group  membership  for  use  in 
multicast  address  expansion.  Conse¬ 
quently,  the  ISIS  application  inter¬ 
face  introduces  four  styles  of  process 
groups  that  differ  in  how  processes 
interact  with  the  group,  illustrated  in 
Figure  9  (anonymous  groups  are  not 
distinguished  from  explicit  groups  at 
this  level  of  the  system).  ISIS  is  opti¬ 
mized  to  detect  and  handle  each  of 
these  cases  efficiently.  The  four 
styles  of  process  groups  are: 

Peer  groups:  These  arise  where  a  set 
of  processes  cooperate  closely,  for 
example,  to  replicate  data.  The 
membership  is  often  used  as  an  input 
to  the  algorithm  used  in  handling 
requests,  as  for  the  concurrent  data¬ 
base  search  described  earlier. 
Client-server  groups:  In  ISIS,  any  pro¬ 
cess  can  communicate  with  any 
group  given  the  group’s  name  and 
appropriate  permissions.  However, 
if  a  nonmember  of  a  group  will  mul¬ 
ticast  to  it  repeatedly,  better  perfor¬ 
mance  is  obtained  by  first  registering 
the  sender  as  a  client  of  the  group; 
this  permits  the  system  to  optimize 


y  At  the  time  of  this  writing  our  group  is  work¬ 
ing  with  the  Open  Software  Foundation  on  in¬ 
tegration  of  a  new  version  of  the  technology 
into  Mach  {the  OSF  1/AD  version)  and  with 
Unix  International,  which  plans  a  reliable 
group  mechanism  for  UI  Adas. 


the  group  addressing  protocol. 
Diffusion  groups:  A  diffusion  group  is  a 
client-server  group  in  which  the  cli¬ 
ents  register  themselves  but  in  which 
the  members  of  the  group  send  mes¬ 
sages  to  the  full  client  set  and  the  cli¬ 
ents  are  passive  sinks. 

Hierarchical  groups:  A  hierarchical 
group  is  a  structure  built  from  multi¬ 
ple  component  groups,  for  reasons 
of  scale.  Applications  that  use  the 
hierarchical  group  initially  contact  its 
root  group,  but  are  subsequently  re¬ 
directed  to  one  of  the  constituent 
“subgroups.”  Group  data  would  nor¬ 
mally  be  partitioned  among  the  sub¬ 
groups.  Although  tools  are  provided 
for  multicasting  to  the  full  member¬ 
ship  of  the  hierarchy,  the  most  com¬ 
mon  communication  pattern  involves 
interaction  between  a  client  and  the 
members  of  some  subgroup. 

There  is  no  requirement  that  the 
members  of  a  group  be  identical,  or 
even  coded  in  the  same  language  or 
executed  on  the  same  architecture. 
Moreover,  multiple  groups  can  be 
overlapped  and  an  individual  pro¬ 
cess  can  belong  to  as  many  as  several 
hundred  different  groups,  although 
this  is  uncommon.  Scaling  is  dis¬ 
cussed  later  in  this  article. 

The  Toolkit  Interface 

As  noted  earlier,  the  performance  of 
a  distributed  system  is  often  limited 
by  the  degree  of  communication 
pipelining  achieved.  The  develop¬ 
ment  of  asynchronous  solutions  to 
distributed  problems  can  be  tricky, 
and  many  ISIS  usei^  would  rather 
employ  less  efficient  solutions  than 
risk  errors.  For  this  reason,  the  tool¬ 
kit  includes  asynchronous  imple¬ 
mentations  of  the  more  important 
distributed  programming  para¬ 


digms.  These  include  a  synchroniza¬ 
tion  tool  that  supports  a  form  of 
locking  (based  on  distributed  to¬ 
kens),  a  replication  tool  for  manag¬ 
ing  replicated  data,  a  tool  for  fault- 
tolerant  primary-backup  server  de¬ 
sign  that  load-balances  by  making 
different  group  members  act  as  the 
primary  for  different  requests,  and 
so  forth  (a  partial  list  appears  in  the 
sidebar  “ISIS  Tools  at  a  Process 
Group  Level).”  Using  these  tools, 
and  following  programming  exam¬ 
ples  in  the  ISIS  manual,  even  non¬ 
experts  have  been  successful  in  de¬ 
veloping  fault-tolerant,  highly  asyn¬ 
chronous  distributed  software. 

Figures  1 0  and  1 1  show  a  com¬ 
plete,  fault-tolerant  database  server 
for  maintaining  a  mapping  from 
names  (ascii  strings)  to  salaries  (inte¬ 
gers).  The  example  is  in  the  C 
programming  language.  The  server 
initializes  ISIS  and  declares  the  pro¬ 
cedures  that  will  handle  update  and 
inquiry  requests.  The  isls-jnainloop 
dispatches  incoming  messages  to 
these  procedures  as  needed  (other 
styles  of  main  loop  are  also  sup¬ 
ported).  The  formatted- I/O  style  of 
message  generation  and  scanning  is 
specific  to  the  C  interface,  where 
type  information  is  not  available  at 
run  time. 

The  “state  transfer”  routines  are 
concerned  with  sending  the  current 
contents  of  the  database  to  a  server 
that  has  just  been  started  and  is  join¬ 
ing  the  group.  In  this  situation,  ISIS 
arbitrarily  selects  an  existing  server 
to  do  a  state  transfer,  invoking  its 
state  sending  procedure.  Each  call 
that  this  procedure  makes  to 
xfer.out  will  cause  an  invocation  of 
rcv_state  on  the  receiving  side;  in 
our  example,  the  latter  simply  passes 
the  message  to  the  update  procedure 
(the  same  message  format  is  used  by 
sencLstate  and  update).  Of  course, 
there  are  many  variants  on  this  basic 
scheme.  For  example,  it  is  possible  to 
indicate  to  the  system  that  only  cer¬ 
tain  servers  should  be  allowed  to 
handle  state  transfer  requests,  to  re¬ 
fuse  to  allowr  certain  processes  to  join, 
and  so  forth.  The  client  program 
does  a  pg_lookup  to  find  the  server. 
Subsequently,  calls  to  its  query  and 
update  procedures  are  mapped  into 
messages  to  the  server.  The  BCAST 
calls  are  mapped  to  the  appropriate 
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Figure  10.  A  simple  database 
server 
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Figure  11.  A  client  of  the  simple 
database  service 


default  for  the  group — ABCAST  in 
this  case. 

I'he  database  server  of  Figure  10 
uses  a  redundant  style  of  execution 
in  which  the  client  broadcasts  each 
request  and  will  receive  multiple, 
identical  replies  from  all  copies.  In 
practice,  the  client  will  wait  for  the 
first  reply  and  ignore  all  others.  Such 
an  approach  provides  the  fastest  pos¬ 
sible  reaction  to  a  failure,  but  has  the 
disadvantage  of  consuming  n  times 
the  resources  of  a  fault-intolerant 
solution,  where  n  is  the  size  of  the 
process  group.  An  alternative  would 
have  been  to  subdivide  the  search  so 
that  each  server  performs  7/rc'th  of 
the  work.  Here,  the  client  would 
combine  responses  from  all  the  serv¬ 
ers,  repeating  the  request  if  a  server 
fails  instead  of  replying,  a  condition 
readily  detected  in  ISIS. 

ISIS  interfaces  have  been  devel¬ 
oped  for  C,  C+  +  ,  Fortran,  Common 
Lisp,  Ada  and  Smalltalk,  and  ports  of 
ISIS  exist  for  Unix  workstations  and 
mainframes  from  all  major  vendors, 
as  well  as  for  Mach,  CHORUS,  ISC 
and  SCO  Unix,  the  DEC  VMS  sys¬ 
tem,  and  Honeywell’s  Lynx  operat¬ 
ing  system.  Data  within  messages  is 
represented  in  the  binary  format 
used  by  the  sending  machine,  and 
converted  to  the  format  of  the  desti¬ 
nation  on  receipt  (if  necessary),  auto¬ 
matically  and  transparendy. 

who  uses  ISIS,  and  How? 

Brokerage 

A  number  of  ISIS  users  are  con¬ 
cerned  with  financial  computing  sys¬ 
tems  such  as  the  one  cited  at  the  be¬ 
ginning  of  this  article.  Figure  12 
illustrates  such  a  system,  now  seen 
from  an  internal  perspective  in 
which  groups  underlying  the  services 
employed  by  the  broker  become  evi¬ 
dent.  A  client  server  architecture  is 
used,  in  which  the  servers  filter  and 
analyze  streams  of  data.  Fault-toler¬ 
ance  here  refers  to  two  very  different 
aspects  of  the  application.  First,  fi¬ 
nancial  systems  must  rapidly  restart 
failed  components  and  reorganize 
themselves  so  that  service  is  not  in- 


include  “isis.h" 

#defme  UPDATE  1 

#define  QUERY  2 

main(  ) 

{ 

isisJnit(O); 

isis_entry(UPDATE.  update,  “update"); 
isis  entry(QUERY.  query,  •■query"); 
pg_join(7demos;salaries"  PG_XFER.  send_state.  0); 
isis_mainloop(0); 

} 

update(mp) 

register  message  *mp. 

{ 

char  name[32]; 
int  salary; 

msg  get(mp.  M°oS,%d",  name,  &salary); 
set  salaryiname,  salary); 

} 

query(mp) 

register  message  *mp; 

l 

char  name[32); 
int  salary; 

msg  get(mp.  ‘4%s,%d‘,  name); 
salary  =  get_salary(name); 
replyjmp,  “%d”,  salary); 

} 

send_state(  ) 

{ 

struct  sdb^entry  ‘sp; 

for(sp  =  head(sdb);  sp  !=  tail(sdb);  sp  =  sp->s_next) 
xfer  out(“%$,%d\  sp->s_name,  sp->s_salary); 

} 

rcv_state(mp) 

register  message  ‘mp: 

{ 

update(mp); 

} 


include  “isis.h" 

#defme  UPDATE  1 

#define  QUERY  2 

address  ’server; 

r  Looks-up  database  and  registers  as  a  client  (for  better  performance)  7 
main(  ) 

{ 

isis_init(0); 

server  =  pg  Jookup(7demos/salaries"); 
pg_client(server); 

) 

update(name,  salary) 
char  ‘name; 

{ 

bcast(server,  UPDATE,  **%s,%d”,  name,  salary,  0); 

} 

get_salary(name) 
char  ‘name; 

{ 

int  salary; 

bcast(server,  QUERY,  name,  1,  “%d".  &salary); 
return(salary); 

) 
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terrupted  by  software  or  hardware 
failures.  Second,  there  are  specific 
system  functions  that  require  fault- 
tolerance  at  the  level  of  files  or  data¬ 
base,  such  as  a  guarantee  that  after 
rebooting,  a  file  or  database  manager 
will  be  able  to  recover  local  data  files 
at  low  cost.  ISIS  was  designed  to  ad¬ 
dress  the  first  type  of  problem,  but 
includes  several  tools  for  solving  the 
latter  one. 

The  approach  generally  taken  is  to 
represent  key  services  using  process 
groups,  replicating  service  state  in¬ 
formation  so  that  even  if  one  server 
process  fails  the  other  can  respond  to 
requests  on  its  behalf.  During  peri¬ 
ods  when  n  service  programs  are 
operational,  one  can  often  exploit 
the  redundancy  to  improve  response 
time;  thus,  rather  than  asking  how 
much  such  an  application  must  pay 
for  fault-tolerance,  more  appropri¬ 
ate  questions  concern  the  level  of 
replication  at  which  the  overhead 
begins  to  outweigh  the  benefits  of 
concurrency,  and  the  minimum  ac¬ 
ceptable  performance  assuming  k 
component  failures.  Fault-tolerance 
is  something  of  a  side  effect  of  the 
replication  approach. 

A  significant  theme  in  financial 
computing  is  use  of  a  subscription/ 
publication  style.  The  basic  ISIS 
communication  primitives  do  not 
spool  messages  for  future  replay, 
hence  an  application  running  over 
the  system,  the  NEWS  facility,  has 
been  developed  to  support  this  func¬ 
tionality. 

A  final  aspect  of  brokerage  sys¬ 
tems  is  that  they  require  a  dynami¬ 
cally  varying  collection  of  services.  A 
firm  may  work  with  dozens  or  hun¬ 
dreds  of  financial  models,  predicting 
market  behavior  for  the  financial  in¬ 
struments  being  traded  under  vary¬ 
ing  market  conditions.  Only  a  small 
subset  of  these  services  will  be 
needed  at  any  time.  Thus,  systems  of 
this  sort  generally  consist  of  a  proces¬ 
sor  pool  on  which  services  can  be 
started  as  necessary,  and  this  creates 
a  need  to  support  an  automatic  re¬ 
mote  execution  and  load  balancing 
mechanism.  The  heterogeneity  of 
typical  networks  complicates  this 
problem,  by  introducing  a  pattern- 
matching  aspect  (i.e.,  certain  pro¬ 
grams  may  be  subject  to  licensing 
restrictions,  or  require  special  pro¬ 


cessors,  or  may  simply  have  been 
compiled  for  some  specific  hardware 
configuration).  This  problem  is 
solved  using  the  ISIS  network  re¬ 
source  manager,  an  application  de¬ 
scribed  later. 

Database  Replication  and  Triggers 

Although  the  ISIS  computation 
model  differs  from  a  transactional 
model  (see  also  the  section  “ISIS  and 
Other  Distributed  Computing  Tech¬ 
nologies’*),  ISIS  is  useful  in  con¬ 
structing  distributed  database  appli¬ 
cations.  In  fact,  as  many  as  half  of  the 
applications  with  which  we  are  famil¬ 
iar  are  concerned  with  this  problem. 

Typical  uses  of  ISIS  in  database 
applications  focus  on  replicating  a 
database  for  fault-tolerance  or  to 
support  concurrent  searches  for 
improved  performance  [2].  In  such 
an  architecture,  the  database  system 
need  not  be  aware  that  ISIS  is  pres¬ 
ent.  Database  clients  access  the  data¬ 
base  through  a  layer  of  software  that 
multicasts  updates  (using  ABCAST) 
to  the  set  of  servers,  while  issuing 
queries  directly  to  the  least  loaded 
server.  The  servers  are  supervised  by 
a  process  group  that  informs  clients 
of  load  changes  in  the  server  pool, 
and  supervises  the  restart  of  a  failed 
server  from  a  checkpoint  and  log  of 
subsequent  updates.  It  is  interesting 
to  realize  that  even  such  an  unsophis¬ 
ticated  approach  to  database  replica¬ 
tion  addresses  a  widely  perceived 
need  among  database  users.  In  the 
long  run,  of  course,  comprehensive 
support  for  applications  such  as  this 
would  require  extending  ISIS  to  sup¬ 
port  a  transactional  execution  model 
and  to  implement  the  XA/XOpen 
standards. 

Beyond  database  replication,  ISIS 
users  have  developed  WAN  data¬ 
bases  by  placing  a  local  database  sys¬ 
tem  on  each  LAN  in  a  WAN  system. 
By  monitoring  the  update  traffic  on 
a  LAN,  updates  of  importance  to 
remote  users  can  be  intercepted  and 
distributed  through  the  ISIS  WAN 
architecture.  On  each  LAN,  a  server 
monitors  incoming  updates  and  ap¬ 
plies  them  to  the  database  server  as 
necessary.  To  avoid  a  costly  concur¬ 
rency  control  problem,  developers  of 
applications  such  as  these  normally 
partition  the  database  so  that  the 
data  associated  with  each  LAN  is  di¬ 


rectly  updated  only  from  within  that 
LAN.  On  remote  LANs,  such  data 
can  only  be  queried  and  could  be 
stale,  but  this  is  still  sufficient  for 
many  applications. 

A  final  use  of  ISIS  in  database  set¬ 
tings  is  to  implement  database  trig¬ 
gers.  A  trigger  is  a  query  that  is  incre¬ 
mentally  evaluated  against  the 
database  as  updates  occur,  causing 
some  action  immediately  if  a  speci¬ 
fied  condition  becomes  true.  For 
example,  a  broker  might  request  that 
an  alarm  be  sounded  if  the  risk  asso¬ 
ciated  with  a  financial  position  ex¬ 
ceeds  some  threshold.  As  data  enters 
the  financial  database  maintained  by 
the  brokerage,  such  a  query  would  be 
evaluated  repeatedly.  The  role  of 
ISIS  is  in  providing  tools  for  reliably 
notifying  applications  when  such  a 
trigger  becomes  enabled,  and  for 
developing  programs  capable  of  tak¬ 
ing  the  desired  actions  despite  fail¬ 
ures. 

Major  ISIS-based  Utilities 

In  the  preceding  subsection,  we  al¬ 
luded  to  some  of  the  fault-tolerant 
utilities  that  have  been  built  over 
ISIS.  There  are  currently  five  such 
systems: 

•  NEWS:  This  application  supports 
a  collection  of  communication  topics 
to  which  users  can  subscribe  (obtain¬ 
ing  a  replay  of  recent  postings)  or 
post  messages.  Topics  are  identified 
with  file-system  style  names,  and 
it  is  possible  to  post  to  topics  on  a 
remote  network  using  a  “mail  ad¬ 
dress”  notation;  thus,  a  Swiss  broker¬ 
age  firm  might  post  some  quotes  to 
“/geneva/quotes/ibm@new-york.” 
The  application  creates  a  process 
group  for  each  topic,  monitoring 
each  such  group  to  maintain  a  his¬ 
tory  of  messages  posted  to  it  for  re¬ 
play  to  new  subscribers,  using  a  state 
transfer  when  a  new  member  joins. 

•  NMGR:  This  program  manages 
batch-style  jobs  and  performs  load 
sharing  in  a  distributed  setting.  This 
involves  monitoring  candidate  ma¬ 
chines,  which  are  collected  into  a 
processor  pool,  and  then  scheduling 
jobs  on  the  pool.  A  pattern-matching 
mechanism  is  used  for  job  place¬ 
ment.  If  several  machines  are  suit¬ 
able  for  a  given  job,  criteria  based  on 
load  and  available  memory  are  used 
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to  select  one  (these  criteria  can  read¬ 
ily  be  changed).  When  employed  to 
manage  critical  system  services  (as 
opposed  to  running  batch-style  jobs), 
the  program  monitors  each  service 
and  automatically  restarts  failed 
components.  Parallel  make  is  an  ex¬ 
ample  of  a  distributed  application 
program  that  uses  NMGR  for  job 
placement:  it  compiles  applications 
by  farming  out  compilation  subtasks 
to  compatible  machines. 

•  DECEIT:  This  system  [33]  pro¬ 
vides  fault-tolerance  NFS-compatible 
file  storage.  Files  are  replicated  both 
to  increase  performance  (by  support¬ 
ing  parallel  reads  on  different  repli¬ 
cas)  and  for  fault  tolerance.  The  level 
of  replication  is  varied  depending  on 
the  style  of  access  detected  by  the  sys¬ 
tem  at  run  time.  After  a  failed  node 
recovers,  any  files  it  managed  are 
automatically  brought  up  to  date. 
The  approach  conceals  File  replica¬ 
tion  from  the  user,  who  sees  an  NFS- 
compatible  file-system  interface. 

•  META/LOMITA:  META  is  an  ex¬ 
tensive  system  for  building  fault- 
tolerant  reactive  control  applications 
[24,  37].  It  consists  of  a  layer  for  in¬ 
strumenting  a  distributed  application 
or  environment,  by  defining  sensors 
and  actuators.  A  sensor  is  any  typed 
value  that  can  be  polled  or  moni¬ 
tored  by  the  system;  an  actuator  is 
any  entity  capable  of  taking  an  action 
on  request.  Built-in  sensors  include 


the  load  on  a  machine,  the  status  of 
software  and  hardware  components 
of  the  system,  and  the  set  of  users  on 
each  machine.  User-defined  sensors 
and  actuators  extend  this  initial  set. 

The  “raw”  sensors  and  actuators 
of  the  lowest  layer  are  mapped  to  ab¬ 
stract.  sensors  by  an  intermediate 
layer,  which  also  supports  a  simple 
database-style  interface  and  a  trig¬ 
gering  facility.  This  layer  supports  an 
entity-relation  data  model  and  con¬ 
ceals  many  of  the  details  of  the  physi¬ 
cal  sensors,  such  as  polling  frequency 
and  fault  tolerance.  Sensors  can  be 
aggregated,  for  example  by  taking 
the  average  load  on  the  servers  that 
manage  a  replicated  database.  The 
interface  supports  a  simple  trigger 
language,  that  will  initiate  a  prespeci¬ 
fied  action  when  a  specified  condi¬ 
tion  is  detected. 

Running  over  META  is  a  distrib¬ 
uted  language  for  specifying  control 
actions  in  high-level  terms,  called 
LOMITA.  LOM1TA  code  is  embed¬ 
ded  into  the  Unix  CSH  command 
interpreter.  At  run  lime,  LOMITA 
control  statements  are  expanded  into 
distributed  finite  state  machines  trig¬ 
gered  by  events  that  can  be  sensed 
local  to  a  sensor  or  system  compo¬ 
nent;  a  process  group  is  used  to  im¬ 
plement  aggregates,  perform  these 
state  transitions,  and  to  notify  appli¬ 
cations  when  a  monitored  condition 
arises. 


Ftgurei2.  Process  group  archi¬ 
tecture  of  brokerage  system 


•  SPOOLER/LONG-HAUL  FACIL¬ 
ITY:  This  subsystem  is  responsible 
for  wide-area  communication  [23] 
and  for  saving  messages  to  groups 
that  are  only  active  periodically.  It 
conceals  link  failures  and  presents  an 
exactly-once  communication  inter¬ 
face. 

Other  ISIS  Applications 

Although  this  section  covered  a  vari¬ 
ety  of  ISIS  applications,  brevity  pre¬ 
cludes  a  systematic  review  of  the  full 
range  of  software  that  has  been  de¬ 
veloped  over  the  system.  In  addition 
to  the  problems  cited,  ISIS  has  been 
applied  to  telecommunications 
switching  and  “intelligent  network¬ 
ing”  applications,  military  systems, 
such  as  a  proposed  replacement  for 
the  AEGIS  aircraft  tracking  and 
combat  engagement  system,  medical 
systems,  graphics  and  virtual  reality 
applications,  seismology,  factory  au¬ 
tomation  and  production  control, 
reliable  management  and  resource 
scheduling  for  shared  computing 
facilities,  and  a  wide-area  weather 
prediction  and  storm  tracking  system 
[2,  17,  35].  ISIS  has  also  proved  pop¬ 
ular  for  scientific  computing  at  labo¬ 
ratories  such  as  CERN  and  Los  Ala¬ 
mos,  and  has  been  applied  to  such 
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problems  as  a  programming  envi¬ 
ronment  for  automatically  introduc¬ 
ing  parallelism  into  data-flow  appli¬ 
cations  [4],  a  beam  focusing  system 
for  a  particle  accelerator,  a  weather- 
simulation  that  combines  a  highly 
parallel  ocean  model  with  a  vec¬ 
torized  atmospheric  model  and  dis¬ 
plays  output  on  advanced  graphics 
workstations,  and  resource  manage¬ 
ment  software  for  shared  supercom¬ 
puting  facilities. 

It  should  also  be  noted  that  al¬ 
though  this  article  has  focused  on 
LAN  issues,  ISIS  also  supports  a 
WAN  architecture  and  has  been  used 
in  WANs  composed  of  up  to  10 
LANs.10  Many  of  the  applications 
cited  are  structured  as  LAN  solutions 
interconnected  by  a  reliable,  but  less 
responsive,  WAN  layer. 

ISIS  and  Other  Distributed 
Computing  Technologies 

Our  discussion  has  overlooked  the 
types  of  real-time  issues  that  arise  in 
the  Advanced  Automation  System,  a 
next-generation  air-traffic  control 
system  being  developed  by  IBM  for 
the  FAA  [13,  14],  which  also  uses  a 
process-group-based  computing 
model.  Similarly,  one  might  wonder 
how  the  ISIS  execution  model  com¬ 
pares  with  transactional  database 
execution  models.  Unfortunately, 
these  are  complex  issues,  and  it 
would  be  difficult  to  do  justice  to 
them  without  a  lengthy  digression. 
Briefly,  a  technology  like  the  one 
used  in  AAS  differs  from  ISIS  in 
providing  strong  real-time  guaran¬ 
tees  provided  that  timing  assump¬ 
tions  are  respected.  This  is  done  by 
measuring  timing  properties  of  the 
network,  hardware,  and  scheduler 
on  which  the  system  runs  and  de¬ 
signing  protocols  to  have  highly  pre¬ 
dictable  behavior.  Given  such  infor¬ 
mation  about  the  environment,  one 
could  undertake  a  similar  analysis  of 
the  ISIS  protocols,  although  we  have 
not  done  so.  As  noted  earlier,  experi¬ 
ence  suggests  that  ISIS  is  fast  enough 


10The  WAN  architecture  of  ISIS  is  similar  to 
the  LAN  structure,  but  because  WAN  partitions 
are  more  common,  encourages  a  more  asyn¬ 
chronous  programming  style.  WAN  communi¬ 
cation  and  link  state  is  logged  to  disk  files  (un¬ 
like  LAN  communication),  which  enables  ISIS 
to  retransmit  messages  lost  when  a  WAN  parti¬ 
tion  occurs  and  to  suppress  duplicate  messages. 
WAN  issues  are  discussed  in  more  detail  in  [233- 


for  even  very  demanding  applica¬ 
tions.11 

The  relationship  between  ISIS 
and  transactional  systems  originates 
in  the  fact  that  both  virtual  syn¬ 
chrony  and  transactional  serializabil- 
ity  are  order-based  execution  models 
[6],  However,  where  the  “tools”  of¬ 
fered  by  a  database  system  focus  on 
isolation  of  concurrent  transactions 
from  one  another,  persistent  data 
and  rollback  (abort)  mechanisms, 
those  offered  in  ISIS  are  concerned 
with  direct  cooperation  between 
members  of  groups,  failure  han¬ 
dling,  and  ensuring  that  a  system  can 
dynamically  reconfigure  itself  to 
make  forward  progress  when  partial 
failures  occur.  Persistency  of  data  is  a 
big  issue  in  database  systems,  but 
much  less  so  in  ISIS.  For  example, 
the  commit  problem  is  a  form  of  reli¬ 
able  multicast,  but  a  commit  implies 
serializability  and  permanence  of  the 
transaction  being  committed,  while 
delivery  of  a  multicast  in  ISIS  pro¬ 
vides  much  weaker  guarantees. 

Conclusions 

We  have  argued  that  the  next  gener¬ 
ation  of  distributed  computing  sys¬ 
tems  will  benefit  from  support  for 
process  groups  and  group  program¬ 
ming.  Arriving  at  an  appropriate 
semantics  for  a  process  group  mech¬ 
anism  is  a  difficult  problem,  and 
implementing  those  semantics  would 
exceed  the  abilities  of  many  distrib¬ 
uted  application  developers.  Either 
the  operating  system  must  imple¬ 
ment  these  mechanisms  or  the  reli¬ 
ability  and  performance  of  group- 
structured  applications  is  unlikely  to 
be  acceptable. 

The  ISIS  system  provides  tools  for 
programming  with  process  groups. 
A  review  of  research  on  the  system 
leads  us  to  the  following  conclusions: 


11  A  process  that  experiences  a  timing  fault  in 
the  protocols  on  which  the  AAS  was  originally 
based  could  receive  messages  that  other  pro¬ 
cesses  reject,  or  reject  messages  others  accept, 
because  the  criteria  for  accepting  or  rejecting  a 
message  uses  the  value  of  the  local  clock  [13]. 
This  can  lead  to  consistency  violations.  More¬ 
over,  if  a  fault  is  transient  (e.g.,  the  clock  is  sub¬ 
sequently  resynchronized  with  other  clocks), 
the  inconsistency  of  such  a  process  could  spread 
if  it  initiates  new  multicasts,  which  other  pro¬ 
cesses  will  accept.  However,  this  problem  can  be 
overcome  by  changing  the  protocol,  and  the 
author  understands  this  to  have  been  done  as 
part  of  the  implementation  of  the  AAS  system. 


•  Process  groups  should  embody 
strong  semantics  for  group  member¬ 
ship,  communication,  and  synchroni¬ 
zation.  A  simple  and  powerful  model 
can  be  based  on  closely  synchronized 
distributed  execution,  but  high  per¬ 
formance  requires  a  more  asynchro¬ 
nous  style  of  execution  in  which  com¬ 
munication  is  heavily  pipelined.  The 
virtual  synchrony  approach  combines 
these  benefits,  using  a  closely  syn¬ 
chronous  execution  model,  but  de¬ 
riving  a  substantial  performance 
benefit  when  message  ordering  can 
safely  be  relaxed. 

•  Efficient  protocols  have  been  de¬ 
veloped  for  supporting  virtual  syn¬ 
chrony. 

•  Nonexperts  find  the  resulting  sys¬ 
tem  relatively  easy  to  use. 

This  article  was  written  as  the  first 
phase  of  the  ISIS  effort  approached 
conclusion.  We  feel  the  initial  system 
has  demonstrated  the  feasibility  of  a 
new  style  of  distributed  computing. 
As  reported  in  [11],  ISIS  achieves 
levels  of  performance  comparable  to 
those  afforded  by  standard  technolo¬ 
gies  (RPC  and  streams)  on  the  same 
platforms.  Looking  to  the  future,  we 
are  now  developing  an  ISIS  “micro¬ 
kernel”  suitable  for  integration  into 
next-generation  operating  systems 
such  as  Mach,  NT,  and  CHORUS. 
This  new  system  will  also  incorporate 
a  security  architecture  [26]  and  a 
real-time  communication  suite.  The 
programming  model,  however,  will 
be  unchanged. 

Process  group  programming 
could  ignite  a  wave  of  advances  in 
reliable  distributed  computing,  and 
of  applications  that  operate  on  dis¬ 
tributed  platforms.  Using  current 
technologies,  it  is  impractical  for  typ¬ 
ical  developers  to  implement  high- 
reliability  software,  self-managing 
distributed  systems,  to  employ  repli¬ 
cated  data  or  simple  coarse-grained 
parallelism,  or  to  develop  software 
that  reconfigures  automatically  after 
a  failure  or  recovery.  Consequently, 
although  current  networks  embody 
tremendously  powerful  computing 
resources,  the  programmers  who 
develop  software  for  these  environ¬ 
ments  are  severely  constrained  by  a 
deficient  software  infrastructure.  By 
removing  these  unnecessary  obsta¬ 
cles,  a  vast  groundswell  of  reliable 
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distributed  application  development 
can  be  unleashed. 
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